Y. Xia, J. Sun Bioinformatic and Statistical Analysis of Microbiome Data https://doi.org/10.1007/978-3-031-21391-5_4

4. Building Feature Table and Feature Representative Sequences from Raw Reads

Yinglin Xia¹ and Jun Sun ¹

(1)

Department of Medicine, University of Illinois Chicago, Chicago, IL, USA

Abstract

Bioinformatic techniques have advanced to correct sequencing errors to determine real biological sequences at single nucleotide resolution by generating amplicon sequence variants (ASVs) or sub-OTUs. QIIME 2 has warped the two most widely used denoising packages DADA2 and Deblur to generate ASVs and sub-OTUs with 100% identities to clinical variation. This chapter describes and illustrates their uses to generate ASVs or sub-OTUs. First, it introduces how to analyze the demultiplexed paired-end FASTQ data. Then it introduces using DADA2 and q2-dada2 plugin to analyze demultiplexed paired-end FASTQ data and the multiplexed paired-end FASTQ data. Next, it introduces using Deblur and q2-deblur plugin to analyze demultiplexed paired-end FASTQ data.

Keywords

Demultiplexed paired-end FASTQ data Sample metadata Raw sequence data Qiime zipped artifacts (.qza) Multiplexed paired-end FASTQ data Quality of the reads q2-deblur plugin list.files() seqkit Keemei

Traditionally, sequence reads are clustered into operational taxonomic units (OTUs) at a defined identity threshold to avoid sequencing errors generating spurious taxonomic units. However, with the bioinformatic technique advancement, recently several bioinformatic software can correct sequencing errors to determine real biological sequences at single nucleotide resolution by generating amplicon sequence variants (ASVs) or sub-OTUs. Both ASVs and sub-OTUs are 100% OTUs and supposedly have 100% identities to clinical variation. The two most widely used denoising packages DADA2 (Callahan et al. 2016b) and Deblur (Amir et al. 2017) have been warped into QIIME 2. In this chapter, we describe and illustrate their uses to generate ASVs or sub-OTUs. First, we introduce how to analyze the demultiplexed paired-end FASTQ data (Sect. 4.1), and then we introduce DADA2 and q2-dada2 plugin and how to use them to analyze demultiplexed paired-end FASTQ data (Sect. 4.2) and analyze the multiplexed paired-end FASTQ data using q2-dada2 plugin (Sect. 4.3), respectively. Next, we introduce Deblur and q2-deblur plugin and how to use them to analyze demultiplexed paired-end FASTQ data (Sect. 4.4). Finally, we briefly summarize in this chapter (Sect. 4.5).

4.1 Analyzing Demultiplexed Paired-End FASTQ Data

Example 4.1: MiSeq_SOP: One Sample Demultiplexed Paired-End FASTQ Data

The sample paired-end fastq data and metadata we analyze here are from a published paper by Schloss et al. (2012) entitled “Stabilization of the murine gut microbiome following weaning.” We have introduced this data in Example 3.9. Here and in other chapters of this book, we use this dataset to illustrate bioinformatic analysis using QIIME 2 and statistical analysis using R. We downloaded the data from http://www.mothur.org/MiSeqDevelopmentData/StabilityNoMetaG.tar. You can download, unzip, and save this dataset into the directory on your computer, and then you can use the following R function list.files() or software seqkit to open the raw sequence data.

Here we extract some rows of them below to give you some idea of these raw sequence data.

4.1.1 Prepare Sample Metadata

Sample metadata contains important biological information in microbiome analysis. They are created typically to collect data information on technical details (i.e., the DNA barcodes), descriptions of the experiment design and samples (i.e., the group, subject, time point, and body site that the sample belongs to). There are no specific restrictions on what types of sample metadata should be used and no enforced “metadata standards” in QIIME 2; however, QIIME 2 does provide some general formatting requirements when creating metadata. For example, although data with any file extensions can be used in QIIME 2, QIIME 2 prefers that sample metadata is stored in a tab-separated values (TSV) file rather than other formats such as the common used comma-separated values (CSV) format. The reason that QIIME 2 uses TSV instead of CSV is that CSV needs to use escape commas which often causes difficulties.

TSV files are simple text files used to store tabular structure data, such as database table or spreadsheet data. The format is supported by many spreadsheet programs and databases. Thus, we can easily use a spreadsheet program such as Microsoft Excel or Google Sheets to edit and export our metadata files. This is also QIIME 2’s recommendation. For TSV files, each row in the table is one line of the text file; each column value of a row is separated from the next by a tab character. Thus, the TSV format belongs to a type of the more general delimiter-separated values format; and is an alternative to the CSV format. The interested reader can check the documents of QIIME 2 for details. Here, we just briefly describe formatting requirements for QIIME 2 sample metadata files.

Identifier Column. The identifier (ID) column is the first column in the metadata file, which defines the sample IDs for sample metadata. The ID column name (i.e., ID header) must be one of the following case-insensitive values, and they are not allowed to be used for naming other IDs or columns in the file: id, sampleid, sample id, sample-id, featureid, feature id, and feature-id.
IDs. IDs may consist of any unicode characters excepting starting with the pound sign (#). One file needs at least one ID. IDs must be unique, but cannot be empty, and cannot use any of the reserved ID column names listed above.
Identifiers. Identifiers should be less or equal to 36 characters, and contain only ASCII alphanumeric characters (i.e., in the range of [a-z], [A-Z], or [0-9]), the period (.) character, or the dash (-) character.
Column Types. QIIME 2 currently supports both categorical and numeric metadata columns. By default, if one column consists only of numbers or missing data, then QIIME 2 will treat the type of metadata column as numeric. Otherwise, if the column contains any non-numeric values, QIIME 2 will treat the column as categorical. Both categorical columns and numeric columns support missing data (i.e., empty cells).

The SampleMetadataMiSeq_SOP.tsv was prepared based on SampleMetadata.xlsx from the published paper. We can prepare this sample metadata by taking the following steps.

Step 1: Collect the study design and sample information using an Excel sheet.
Step 2: Upload the Excel sheet into Google sheet.
First go to Google Drive homepage and log in using your credentials. In the Google Drive homepage, click New ➔ select Google Sheets➔click File➔Import➔in Import file screen, click Upload➔ then you can either drag the SampleMetadata excel sheet file to the box, or click Select a file from your device➔then open the file. In the Import file screen, default option is “Replace spreadsheet”; just choose it and click Import data. Then the SampleMetadata.xlsx was uploaded into Google Sheet.
Step 3: Install open source Google sheets add-on Keemei program.
QIIME 2 needs sample metadata spreadsheet correctly formatted. You can set this file as a Google sheet and then use the Keemei (canonically pronounced key may) program (Rideout et al. 2016) for Google sheets to check whether the file is correctly formatted. Keemei is an open source Google Sheets add-on for cloud-based validating tabular bioinformatics file formats, including QIIME 2 metadata files. To install Keemei, first log in to your free Google account, then you have two options to install Keemei to your Google sheets. (1) Go to https://keemei.qiime2.org/ webpage, click the Chrome Web Store, and then click INSTALL and follow the direction to install. (2) From within a Google Sheet: click and search for Keemei. Once Keemei is installed, you can use it to validate the SampleMetadata.xlsx file.
Step 4: Check whether the sample metadata spreadsheet is correctly formatted using Keemeil.
To validate the SampleMetadata.xlsx, click File➔Make a copy, and name it as Copy of SampleMetadata. Now you can start to validate the sample metadata with Keemei. To validate this active sheet, click Add-ons➔Validate QIIME 2 metadata file. When you see Keemeil validation report says “Good news! Sheet metadata is a valid QIIME 2 metadata file,” then your spreadsheet is correctly formatted for QIIME 2. In this case, the SampleMetadata spreadsheet passes Keemei validation. Now, click File again and choose Download as Table-separated values (.tsv, current sheet). You now can save and rename it as what you want. In this case, we name it as SampleMetadataMiSeq_SOP.tsv.
Once this spreadsheet is correctly formatted for QIIME 2, the file is ready for use. If cells come out with red, it suggests that these cells have errors; if cells come out with yellow, which suggests that these cells have warnings. Then a sidebar summaries the validation report and lists invalid cells. Locate the cells with errors and warnings, fix all the invalid cells, and revalidate until all cells are valid. To clear the validation status on the active sheet, by clicking Add-ons➔Keemei➔Clear validation status, the cell background colors will reset to white and notes will be cleared.
Step 5: Further inspect the sample metadata in QIIME 2.
To further inspect the sample metadata in QIIME 2, create a working directory (here, QIIME2R-Bioinformatics) and put the SampleMetadataMiSeq_SOP.tsv in this working directory.

source activate qiime2-2022.2

mkdir QIIME2R-Bioinformatics

cd QIIME2R-Bioinformatics

Then type the following commands in terminal (Fig. 4.1).

# Figure 4.1

qiime tools inspect-metadata SampleMetadataMiSeq_SOP.tsv

Fig. 4.1
Inspection of sample metadata for the mouse gut microbiome study

4.1.2 Prepare Raw Sequence Data

Different sequencing platforms (e.g., Illumina vs. Ion Torrent) or different sequencing approaches (e.g., single-end vs. paired-end) will provide us different structured raw data. In addition, any pre-processing steps such as joined paired ends and barcodes in fastq header performed by sequencing centers also will result in different structured raw data.

The downloaded paired-end raw sequence data in Example 4.1 was demultiplexed consisting of both forward.fastq.gz and reverse.fastq.gz. We first create a sub-directory called “MiSeq_SOP” within the directory QIIME2R-Bioinformatics and then store the sequence file there.

mkdir QIIME2R-Bioinformatics/MiSeq_SOP

Now the files are ready for analysis. Currently two approaches are available in QIIME 2 to construct a feature table from raw reads: either using data2 or deblur plugins. We illustrate their uses respectively as below.

4.1.3 Import Data Files as Qiime Zipped Artifacts(.qza)

QIIME 2 works with artifacts (.qza). We must first import the FASTQ files as a QIIME artifact using the import command qiime tools import. As we described in Chap. 3, if the data do not have either EMP or Casava format, the data need to be manually imported into QIIME 2. First you need to create a tab-separated (i.e., .tsv) “manifest” text file. In this case, we created a manifest file called ManifestMiSeq_SOP.tsv as the same way we created the SampleMetadataMiSeq_SOP.tsv and stored it in the same working directory: QIIME2R-Bioinformatics.

One important thing we emphasize here is: the first column in manifest file defines the Sample ID, while the second and third columns are the absolute file path to the forward and reverse reads, respectively. The names of sample-id (e.g., F3D0 in first row) must be same as names in the parts of sequences (e.g., F3D0 in first row for that sequences) for each sample.

sample-id forward-absolute-filepath reverse-absolute-filepath

F3D0 $PWD/MiSeq_SOP/F3D0_S188_L001_R1_001.fastq.gz $PWD/MiSeq_SOP/F3D0_S188_L001_R2_001.fastq.gz

F3D1 $PWD/MiSeq_SOP/F3D1_S189_L001_R1_001.fastq.gz $PWD/MiSeq_SOP/F3D1_S189_L001_R2_001.fastq.gz

F3D2 $PWD/MiSeq_SOP/F3D2_S190_L001_R1_001.fastq.gz $PWD/MiSeq_SOP/F3D2_S190_L001_R2_001.fastq.gz

$PWD/MiSeq_SOP/ is absolute-file path which links the sample names to sequence information for each sample. $PWD is a bash variable for the full path of the current working directory (here, QIIME2R-Bioinformatics). In this case, ManifestMiSeq_SOP.tsv was stored in the directory QIIME2R-Bioinformatcs, and raw sequences were stored in the sub-directory “MiSeq_SOP.” The absolute path $PWD/MiSeq_SOP/ links them. In the following commands, we put the ManifestMiSeq_SOP.tsv in the input-path option position.

source activate qiime2-2022.2

cd QIIME2R-Bioinformatics

qiime tools import\

--type 'SampleData[PairedEndSequencesWithQuality]'\

--input-path ManifestMiSeq_SOP.tsv\

--output-path PairedEndDemuxMiSeq_SOP.qza \

--input-format PairedEndFastqManifestPhred33V2

QIIME 2 uses SampleData[PairedEndSequencesWithQuality] to indicate the sequence data for each sample are paired forward/reverse FASTQ files. So we specify the data type as 'SampleData[PairedEndSequencesWithQuality]'. We specify input data format as PairedEndFastqManifestPhred33V2 and name the output artifact as PairedEndDemuxMiSeq_SOP.qza. This file will contain a copy of each of the sequence data files, which will enhance research reproducibility.

You can check this qiime zipped artifact (.qza) using qiime tools peek command.

4.1.4 Examine and Visualize the Qualities of the Sequence Reads

To generate visualizations of the sequence qualities, you can run the command:

# Figures 4.2 and 4.3

qiime demux summarize \

--p-n 10000 \

--i-data PairedEndDemuxMiSeq_SOP.qza \

--o-visualization PairedEndDemuxMiSeq_SOP.qzv

Or, the following command can be used since 10000 random sampling is default in QIIME 2:

# Figures 4.2 and 4.3

qiime demux summarize \

--i-data PairedEndDemuxMiSeq_SOP.qza \

--o-visualization PairedEndDemuxMiSeq_SOP.qzv

To review the visualization of the PairedEndDemuxMiSeq_SOP.qzv file, you can navigate to QIIME2 viewer in browser. In other words, copy over the PairedEndDemuxMiSeq_SOP.qzv output to your computer, and open this file in www.view.qiime2.org. The following plot is from the interactive quality plot.

Figures 4.2 and 4.3 show the quality profile across a sample of 10,000 for forward and reverse reads, respectively.

Fig. 4.2
Quality score box plots sampled from 10,000 random forward reads

Fig. 4.3
Quality score box plots sampled from 10,000 random reverse reads

Illumina sequencing data generally show a trend of decreasing average quality towards the end of sequencing reads. In this case, the forward reads and the reverse reads display different patterns of quality: the forward reads maintain high quality over through, whereas the quality of the reverse reads drops significantly at around the position 160. Based on the different quality information for forward and reverse reads, we differentially truncate the forward reads at position 240, and the reverse reads at position 160 as did by Callahan et al. (2016a).

We also notice that both quality plots have slightly lower quality scores at the beginning of each read, which are caused by the homogeneity of the primer sequences. It is difficult for Illumina sequencer to properly identify clusters of DNA molecules (Fadrosh et al. 2014). Thus, the primer sequences must be removed at the denoising stage. Typically, the first 10 nucleotides of each read will be trimmed based on empirical observations across many Illumina datasets because these base positions are particularly likely to contain pathological errors (Callahan et al. 2016a).

4.2 Analyzing Demultiplexed Paired-End FASTQ Data Using DADA2 and q2-dada2 Plugin

The 16S rRNA marker gene sequencing approach has several advantages (Xia et al. 2018) including its own unique structure that contains both conserved and variable regions and its presence in all known Bacteria and Archaea species. This sequencing approach also has the advantages compared to shotgun metagenomic sequencing: low cost and avoid of problems with sequencing non-microbial DNA from host contamination. However, the 16S rRNA marker gene sequencing approach has the issue of sequencing errors, which makes it difficult to distinguish biologically real nucleotide differences in 16S sequences from sequencing artifacts. For example, traditionally the OTU method clusters sequence reads into Operational Taxonomic Units (OTUs). This method is used by most of the available pipelines (Caporaso et al. 2010; Schloss 2020; Mysara et al. 2017; Kumar et al. 2011; Hildebrand et al. 2014). However, the OTU method has several fundamental weaknesses such as clustering sequences with a fixed 3% dissimilarity threshold might avoid fine-scale variation among sequences (Rosen et al. 2012); therefore it often eliminates biological information present in the data. OTUs are not species; thus their construction is not necessitated by amplicon errors (Callahan et al. 2016a). In Chap. 6, we illustrate how to cluster OTUs via QIIME 2.

The methods for processing and analysis of 16S marker gene sequencing data continue to improve. DADA2(DADA: Divisive Amplicon Denoising Algorithm) (Callahan et al. 2016b) and Deblur (Amir et al. 2017) methods have been developed and recognized as one major advance towards quality control measures through denoising sequences to better discriminate between true sequence diversity and sequencing errors. Performing quality control of the sequences typically is performed prior to taxonomic classification. The goal is to identify the poor-quality reads and residual contamination in the dataset.

In this chapter, we illustrate how to conduct quality controlling sequences or denoising and QC filtering to generate feature table and feature data.

4.2.1 Introduction to DADA2 and q2-dada2 Plugin

DADA2 was proposed to use “amplicon sequence variants” (ASVs) to replace OTUs as the standard unit of marker-gene analysis and reporting (Callahan et al. 2017). DADA2 uses an error-modeling approach for denoising and clustering amplicons and outputs exact sequence variants (or ASVs). DADA2 aims to overcome the fundamental weaknesses of traditional OTU methods, and to improve the performance of newly proposed bioinformatic sequence denoising approaches including UPARSE, MED, mothur (average linkage), and QIIME (uclust) OTU methods (Callahan et al. 2016a). It was demonstrated that DADA2 methods are more accurate compared to these four methods (Callahan et al. 2016a). It was shown (Callahan et al. 2016b) that DADA2 exactly infers sample sequences and resolves differences of as little as one nucleotide. To model and correct Illumina-sequenced amplicon errors, the software package DADA2 was developed. The R package DADA2 can implement the full amplicon workflow from filtering, dereplication, sample inference, chimera identification to merging the paired-end reads (Callahan et al. 2016a). For the details on algorithm and the development of DADA2, the reader is referred to Sect. 8.3.2.4.

Here, we illustrate how to implement the DADA2 methods via the DADA2 plugin in QIIME 2. The DADA2 plugin has several methods to denoise reads, including (1) denoise paired-end, which requires unmerged, paired-end reads (i.e., both forward and reverse); and (2) denoise single-end, which accepts either single-end or unmerged paired-end data. When the unmerged paired-end data are provided, only the forward reads will be used and the reverse reads will be ignored.

Implementing DADA2 and Deblur methods will generate two QIIME 2 artifacts: a FeatureTable[Frequency] and a FeatureData[Sequence]. The FeatureTable[Frequency] artifact contains counts (frequencies) of each unique sequence in each sample in the dataset, and the FeatureData[Sequence] artifact maps feature identifiers in the FeatureTable to the sequences they represent. In QIIME 1, they were called Biom table and rep_set fasta file, respectively.

DADA2 and Deblur are currently the two denoising methods available in QIIME 2. We apply the DADA2 approach to the mouse gut microbiome data (Example 4.1: MiSeq_SOP).

4.2.2 Denoise Sequences to Construct Feature Table and Feature Data with q2-dada2 Plugin

A feature is a species or an OTU in the context of microbiome sequencing or a gene in the RNA-Seq context. Specially in DADA2 and Deblur, features are ASVs or sub-OTUs, respectively. In QIIME 2, denoising Illumina sequence via DADA2 is an alternative option to OTU clustering as defined sequence-identify cut-off (e.g., 97%). The qiime dada2 denoise-paired method performs merging and denoising paired-end reads to denoise paired-end sequences, dereplicate them, and filter chimeras.

The dada2 denoise-paired method requires one paired-end demultiplexed sequences (an artifact SampleData[PairedEndSequencesWithQuality]) as input data and four parameters that are used in quality filtering: --p-trim-left-f, --p-trim-left-r, --p-trunc-len-f, and --p-trunc-len-r.

The parameters --p-trim-left-f and --p-trim-left-r (optional) are used to trim the 5′ end of the input sequences, which will be the bases that were sequenced in the first cycles. When primers are present in the input sequence files, DADA2 requires removing the primers from the data to prevent false positive detection of chimeras as result of degeneracy in the primers before denoising DADA2 remove the setted length of the primer sequences. The parameter --p-trim-left-f is used to specify an integer for the position at which forward read sequences should be trimmed due to low quality. Default is 0 for not trimming the forward read sequences. The parameter --p-trim-left-r is used to specify an integer for the position at which reverse read sequences should be trimmed due to low quality. Default is 0.
The parameter --p-trunc-len-f indicates the position at which the forward sequence will be truncated and parameter --p-trunc-len-r indicates the position at which the reverse read will be truncated. They are used to truncate the 3′ end of the of the input sequences, which will be the bases that were sequenced in the last cycles. In above Interactive Quality Plot tab in the visualization of PairedEndDemuxMiSeq_SOP.qzv file that was generated by qiime demux summarize command, there are quality scores for each reads. To determine what values to pass for these two parameters (--p-trunc-len-f and --p-trunc-len-r), you should review the Interactive Quality Plot tab. Specifying 0, no truncation or length filtering will be performed.
The parameter --p-max-ee (--p-max-ee-f and --p-max-ee-r for forward reads and reverse reads, respectively) (optional) controls the maximum number of expected errors in a sequence before it is discarded (default is 2). The default 2 is used to enforce a maximum of 2 expected errors per-read (Edgar and Flyvbjerg 2015), which combines the trimming parameter with standard filtering parameters, and is considered as a better filter than simply averaging quality scores (Edgar and Flyvbjerg 2015). DADA2 trims and filters paired reads jointly, i.e., both reads must pass the filter for the pair to pass.
The parameter --p-truncac-q (optional) is used to truncate the sequence after the first position that has a quality score equal to or less than the provided value (default is 2).
The parameter --p-pooling-method (optional) is used to specify pool samples for denoising. By default, samples are denoised independently (“independent”). If it is specified as (“pseudo”), the pseudo-pooling method is used to approximate pooling of samples.
The parameter --p-chimera-method (optional) is used to specify the method (“none,” “pooled,” “consensus”) to remove chimeras. Specifying “none,” no chimera is removed; specifying “pooled,” all reads are pooled prior to chimera detection, while by default (“consensus”), chimeras are detected in samples individually, and sequences are removed if chimeras are found in a sufficient fraction of samples.
The parameter --p-n-threads (optional) is used to specify the number of threads to use for multithreaded. Processing with 1 as default and 0 for using all available cores.
The parameter --p-n-reads-learn (optional) is used to specify the number of reads to use when training the error model. By default, 1,000,000 is used with smaller numbers for a shorter run time but a less reliable error model.
The reader can refer to the QIIME 2 documentation for other input parameters.

We implement the following commands to denoise the sequences with DADA2 based on quality score visualizations.

cd QIIME2R-Bioinformatics

qiime dada2 denoise-paired \

--i-demultiplexed-seqs PairedEndDemuxMiSeq_SOP.qza \

--p-trim-left-f 10 \

--p-trim-left-r 10 \

--p-trunc-len-f 240 \

--p-trunc-len-r 160 \

--p-n-threads 4 \

--o-table FeatureTableMiSeq_SOP \

--o-representative-sequences RepSeqsMiSeq_SOP.qza \

--o-denoising-stats DenoisingStatsMiSeq_SOP.qza \

--verbose

Running external command line application(s). This may print messages to stdout and/or stderr.

The command(s) being run are below. These commands cannot be manually re-run as they will depend on temporary files that no longer exist.

Command: run_dada_paired.R /var/folders/80/b4jv62j553b9g3s7l5vxbcg40000gn/T/tmpxqoc0vjc/forward /var/folders/80/b4jv62j553b9g3s7l5vxbcg40000gn/T/tmpxqoc0vjc/reverse /var/folders/80/b4jv62j553b9g3s7l5vxbcg40000gn/T/tmpxqoc0vjc/output.tsv.biom /var/folders/80/b4jv62j553b9g3s7l5vxbcg40000gn/T/tmpxqoc0vjc/track.tsv /var/folders/80/b4jv62j553b9g3s7l5vxbcg40000gn/T/tmpxqoc0vjc/filt_f /var/folders/80/b4jv62j553b9g3s7l5vxbcg40000gn/T/tmpxqoc0vjc/filt_r 240 160 10 10 2.0 2.0 2 12 independent consensus 1.0 4 1000000

R version 4.1.2 (2021-11-01)

Loading required package: Rcpp

DADA2: 1.22.0 / Rcpp: 1.0.8 / RcppParallel: 5.1.5

1) Filtering

........................................................................................................................................................................................................................................................................................................................................................................

2) Learning Error Rates

232305520 total bases in 1010024 reads from 94 samples will be used for learning the error rates.

151503600 total bases in 1010024 reads from 94 samples will be used for learning the error rates.

3) Denoise samples ........................................................................................................................................................................................................................................................................................................................................................................

4) Remove chimeras (method = consensus)

6) Write output

In above commands, we use the parameter --p-n-threads 4 to allow the program to perform parallel computations on 4 threads. If your datasets are very large, DADA2 may be slow. You may need increase the number of threads. The option --verbose is used to display the DADA2 progress in the terminal as shown above. The printed information in the terminal with the --verbose option shows 6 stages of denoising process.

The denoising process outputs three artifacts: (1) a FeatureTable[Frequency] file via --o-table (we named it as FeatureTableMiSeq_SOP.qza), (2) a FeatureData[Sequence] via --o-representative-sequences, which is representative sequence file (we named it as RepSeqsMiSeq_SOP.qza), and (3) an artifact via --o-denoising-stats (DADA 2 Stats , we named it as DenoisingStatsMiSeq_SOP.qza). All these three output file names are required. The feature table file is the Biological Observation Matrix(BIOM) format file. The representative sequence file contains the denoised sequences, while the table file maps each of the sequences onto their denoised parent sequence.

The produced feature table by DADA2 method is a higher-resolution analogue of the common “OTU table”; however, the count reads are called amplicon sequence variants (ASVs) instead of OTUs, which are thought resolving variants that differ by as little as one nucleotide (Callahan et al. 2016b).

4.2.3 Summarize the Feature Table and Feature Data from q2-dada2 Plugin

After successfully denoising sequences for each sample and generating feature table and representative sequences, we can summarize the denoised data using the qiime feature table command.

qiime feature-table summarize \

--i-table FeatureTableMiSeq_SOP.qza \

--o-visualization FeatureTableMiSeq_SOP.qzv \

--m-sample-metadata-file SampleMetadataMiSeq_SOP.tsv

qiime feature-table tabulate-seqs \

--i-data RepSeqsMiSeq_SOP.qza \

--o-visualization RepSeqsMiSeq_SOP.qzv

The two produced visualization files (.qzv) by the above commands can be explored via the QIIME2 viewer. The “interactive Sample Detail” tab provides detailed information about the denoised sequence counts, such as the number of sequence per sample. We can explore to determine how rarefaction depths (subsampling) will impact your data. For example, we may check which samples have lowest sequencing depths to be dropped.

4.3 Analyzing Multiplexed Paired-End FASTQ Data Using q2-dada2 Plugin

To illustrate the bioinformatic workflow of QIIME 2 and multiplexed paired-end fastq data using QIIME 2, in this section, we demonstrate the steps of analyzing demultiplexed paired-end fastq data using Atacama soil microbiome data.

Example 4.2: Atacama Soil Microbiome

The data used here was originally from the study of “Significant Impacts of Increasing Aridity on the Arid Soil Microbiome” (Neilson et al. 2017), which analyzes the soil samples from the Atacama Desert in northern Chile. This desert is one of the most arid locations on Earth, where some areas receive less than a millimeter of rain per decade. We downloaded the data from the QIIME 2 website and use the data to illustrate how to denoise sequences to construct a feature table and the associated feature sequences using Deblur along with the importing, demultiplexing, and some other preliminary works.

4.3.1 Prepare Sample Metadata

To store the data, we first create a subdirectory within the QIIME2R-Bioinformatics directory.

cd QIIME2R-Bioinformatics

mkdir Atacama

cd Atacama

The sample metadata is available as a Google Sheet from QIIME 2 website. We download and save as SampleMetadataAtacama to the directory Atacama. Since the sample metadata is from QIIME 2, of course it has passed the validation of Keemei program and ready for use. We can further inspect the sample metadata in QIIME 2 by typing the following commands in terminal (Fig. 4.1).

#Figure 4.1 qiime tools inspect-metadata SampleMetadataAtacama.tsv

We can also create a visualization of this sample metadata (a qiime zipped visualization (.qzv file)) and navigate to QIIME2 viewer in browser to view this visualization.

qiime metadata tabulate\

--m-input-file SampleMetadataAtacama.tsv\

--o-visualization TabulatedSampleMetadataAtacama.qzv

4.3.2 Prepare Raw Sequence Data

To store the raw sequence data, we create a sub-directory within the directory Atacama.

mkdir EmpPairedEndSequences

The paired-end raw sequence data consist of three fastq format files: forward.fastq.gz, reverse.fastq.gz, and barcodes.fastq.gz. They represent forward reads, reverse reads, and barcodes in sequencing run, respectively. Here, we use a 10% subsample data downloaded from QIIME 2 website. We move these three fastq files into the EmpPairedEndSequences working directory we just created and now the files are ready for use. The FASTQ data have the specific format of EMP (EMPPairedEndSequences).

The sequences with the format of EMPPairedEndSequences in QIIME 2 artifacts are multiplexed, suggesting that the sequences have not yet been assigned to samples, and hence to process this kind of sequences, both sequences.fastq.gz and barcodes.fastq.gz files are needed, in which the barcodes.fastq.gz contains the barcode read associated with each sequence in sequences.fastq.gz.

4.3.3 Import Data Files as Qiime Zipped Artifacts(.qza)

The data format used here is called EMPPairedEndSequences. We import the sequences into an artifact using the qiime tools import commands, which creates an artifact of the data.

qiime tools import \

--type EMPPairedEndSequences \

--input-path EmpPairedEndSequences \

--output-path EmpPairedEndSequencesAtacama.qza

We can check this qiime zipped artifact (.qza) using qiime tools peek command.

4.3.4 Demultiplexing Sequences

The next-generation sequencing instruments are able to analyze multiple samples in a single lane/run through multiplexing these samples. These samples are typically appended a unique barcode (a.k.a. index or tag) sequence to one or both ends of each sequence to identify their originals. Detecting these barcode sequences and mapping them back to the samples they belong to is called demultiplexing sequences.

In QIIME 2, two plugins are available for demultiplexing sequences: q2-demux and q2-cutadapt. However, depending on the type of raw sequence data either EMP Single End, EMP Paired End, Multiplexed Single End Barcode, or Multiplexed Paired End Barcode, usually only one demultiplexing action available in q2-demux or q2-cutadapt for the data. In the case, the barcodes have already been removed from the reads and are in a separate file, then q2-demux can be used; while if the barcodes are still in the sequences, then q2-cutadapt can be used instead.

Since demultiplexing sequences are to detect the barcode sequences and to map them back to the samples they belong to, thus, the sample metadata file is required. To demultiplex, we must specify which column in the sample metadata file contains the per-sample barcodes. As shown in Fig. 4.4, in this case, that column name is barcode-sequence. Additionally, in this dataset, the barcode reads are the reverse complement of those included in the sample metadata file, so we also need to include the --p-rev-comp-mapping-barcodes parameter.

qiime demux emp-paired \

--m-barcodes-file SampleMetadataAtacama.tsv \

--m-barcodes-column barcode-sequence \

--p-rev-comp-mapping-barcodes \

--i-seqs EmpPairedEndSequencesAtacama.qza \

--o-per-sample-sequences DemuxAtacama.qza \

--o-error-correction-details DemuxDetailsAtacama.qza

4.3.5 Summarize the Demultiplexing Results and Examine Quality of the Reads

After demultiplexing, we can use the following commands to create a visualization.

qiime demux summarize \

--i-data DemuxAtacama.qza \

--o-visualization DemuxAtacama.qzv

Fig. 4.4
Inspection of sample metadata for the Atacama soil microbiome study

Similarly as we review the visualization of the sample metadata, to view this visualization we can navigate to QIIME2 viewer in browser or copy over the .qzv output to the computer, and open DemuxAtacama.qzv in www.view.qiime2.org. We can view how many sequences were obtained per sample. Click “Overview”; it shows the demultiplexed sequence counts summary (minimum, median, mean, maximum, total) and the per-sample sequence counts (total samples, sample name, and sequence count). The demultiplexed sequence counts summary is displayed as a frequency plot, which is downloadable as a .pdf file. The per-sample sequence counts can be downloaded as a .csv format file. In the “Interactive Quality Plot,” we can hover over a specific position to check how many reads are at least that long. These are the reads that were sampled for computing sequence quality. We can click at any position on the plots of Forward Reads and Reverse Reads to check the quality score for that position. For example, for Forward Reads, the 50th percentile (median) quality score at position 100 is 38. We can download both forward and reverse parametric seven-number summaries (2nd, 9th, 25th, 50th (Median), 75th, 91st, and 98th percentiles) as a .csv format file. The “Interactive Quality Plot” also includes a demultiplexed sequence length summary (Figs. 4.5 and 4.6).

Fig. 4.5
Quality score box plots sampled from 10,000 random forward reads for Atcama soil study

Fig. 4.6
Quality score box plots sampled from 10,000 random reverse reads for Atcama soil study

4.4 Analyzing Demultiplexed Paired-End FASTQ Data Using Deblur and q2-deblur Plugin

In mouse gut microbiome example (Example 4.1), we show how to conduct denoising and QC filtering sequences to generate feature table and feature data using DADA2 method. Here, we illustrate the Deblur, another denoising method currently available in QIIME 2.

4.4.1 Introduction to Deblur and q2-deblur Plugin

Deblur (Amir et al. 2017) was developed by taking a sub-operational-taxonomic-unit (sub-OTU or sOTU) approach, aiming to identify exact sequences, i.e., obtain putative error-free sequences or single-nucleotide resolution in amplicon studies such as from Illumina MiSeq and HiSeq sequencing platforms. To obtain single-nucleotide resolution, Deblur employs a sample-by-sample approach combined with a greedy algorithm, which compares sequence-to-sequence Hamming distances within a sample to an upper-bound error profile (Amir et al. 2017). Deblur uses an upper error rate bound and a constant probability of indels and the mean read error rate together, this algorithm enabling removing predicted error-derived reads from neighboring sequences when the sequences are aligned together into “sub-OTUs” (Amir et al. 2017). Like ASVs, the sub-OTUs are considered representing the true biological sequences present in the data. The benefits of employing a sample-by-sample approach are that Deblur reduces both memory requirements and computational demand (Amir et al. 2017; Nearing et al. 2018). For the details on algorithm and the development of Deblur, the reader is referred to Sect. 8.3.2.6.

4.4.2 Process Initial Quality Filtering

To obtain high-quality sequence variant data, Deblur uses sequence error profiles to relate erroneous sequence reads to the true biological sequence from which they belong to. To achieve this goal, an initial quality filtering approach should be taken based on quality scores, which was recommended by Bokulich et al. in 2013 (Bokulich et al. 2013). To perform sequence quality control, QIIME 2 has wrapped Deblur quality filtering method in the q2-deblur plugin as implemented in q2-quality-filter.

Below we process an initial quality-filter with default settings prior to using Deblur. As DADA2 method, Deblur also has an option to truncate reads to a constant length prior to denoising with the parameter –p-trim-length. The truncating parameter is optional. We can specify the parameter –p-trim-length -1 to disable truncation in Deblur.

cd QIIME2R-Bioinfromatics/Atacama

qiime quality-filter q-score\

--i-demux DemuxAtacama.qza\

--o-filtered-sequences DemuxFilteredAtacama.qza\

--o-filter-stats DemuxFilterStatsAtacama.qza

4.4.3 Preliminary Works for Denoising with Deblur

The above artifact DemuxAtacama.qza is a demultplexed EmpPairedEndSequencesAtacama.qza file obtained in Sect. 4.3.4. It is SampleData[PairedEndSequencesWithQuality] file. After implementing qiime quality-filter q-score, this artifact was saved as DemuxFilteredAtacama.qza into the QIIME2R-Bioinfromatics/Atacama directory folder. Two denoise-methods are available in the deblur plugin to denoise sequences: (1) denoise-16S for denoising 16S sequences and (2) denoise-other for denoising other types of sequences. Some preliminary works need to be done before denoising with Deblur.

Step 1: Join reads.
Deblur needs the paired-end sequences jointed before using it to denoise sequences. Because the paired-end sequences have not been jointed, we here join the paired-end reads in QIIME 2 using the q2-vsearch plugin.

qiime vsearch join-pairs \

--i-demultiplexed-seqs DemuxAtacama.qza\

--o-joined-sequences DemuxJoinedAtacama.qza

Step 2: Generate a summary of the joined data (in this case, DemuxJoinedAtacama.qza).

qiime demux summarize \

--i-data DemuxJoinedAtacama.qza \

--o-visualization DemuxJoinedAtacama.qzv

Like other demultiplexing sequences, it is crucial to view the summary of joined data with read quality via the QIIME2 viewer. This summary provides several particularly useful information including how long the joined reads are and how many reads were used to estimate the quality score distribution at a specific position. The quality score plots show that most sequences are at least 250 bases long, which provides the important information to specify the long value to trim the sequences (Fig. 4.7).

Fig. 4.7
Quality score box plots sampled from 10,000 random reads for Atcama soil study

Step 3: Conduct sequence quality control to the sequences using quality-filter q-score-joined
The quality-filter q-score-joined method is identical to quality-filter q-score, except that it operated on joined reads.

qiime quality-filter q-score \

--i-demux DemuxJoinedAtacama.qza \

--o-filtered-sequences DemuxJoinedFilteredAtacama.qza \

--o-filter-stats DemuxJoinedFilterStatsAtacama.qza

4.4.4 Denoise Sequences with Deblur to Construct Feature Table and Feature Data

The two denoise-sequences methods in the deblur plugin, denoise-16S and denoise-other, are used in different ways. When using denoise-16S, an initial positive filtering step will be performed to discard those reads that have less than 60% identity similarity to sequences from the 85% OTU GreenGenes database. Otherwise Deblur recommends using the denoise-other method. For example, when you apply Deblur to 18S data, you need to specify a reference composed of 18S sequences so that you can filter out sequences which do not appear to be 18S.

The qiime deblur denoise-16S performs sequence quality control for Illumina data using a 16S reference as a positive filter. Currently QIIME 2 only supports forward reads and uses the 88% OTUs from Greengenes 13_8 as the specific reference. Thus, this method is only limited to be used for a 16S amplicon protocol on an Illumina platform.

Deblur can start here to denoise the sequences, which will provide us additional quality control and similarly much higher-quality results as DADA2 did. The crucial action here is to specify a sequence length value for --p-trim-length parameter based on reviewing quality score plots (250 in this case). All the sequences will be trimmed to this length, and any sequences which are not at least this long will be discarded.

Here we use the denoise-16S method. In the following commands, the required inputs (demultiplexed sequences) are one of artifacts: (1) SampleData[SequencesWithQuality], (2) SampleData[PairedEndSequencesWithQuality], or (3) SampleData[JoinedSequencesWithQuality](DemuxJoinedFilteredAtacama.qza), which will be denoised. The parameter --p-trim-length is used to specify sequence trim length, which is also required. Specifying −1 to disable trimming.

The parameters --p-sample-stats or --p-no-sample-stats are used to gather or not gather stats per sample. The default is false. Here we want to gather the stats.

Three outputs need to be required, including --o-table, an artifact of FeatureTable[Frequency], which is the resulting denoised feature table; --o-representative-sequences, an artifact of FeatureData[Sequence], which is the resulting feature sequences; and --o-stats, an artifact of per-sample stats.

qiime deblur denoise-16S \

--i-demultiplexed-seqs DemuxJoinedFilteredAtacama.qza \

--p-trim-length 250 \

--p-sample-stats \

--o-table FeatureTableAtacama.qza \

--o-representative-sequences RepSeqsAtacama.qza \

--o-stats StatsAtacama.qza

In above commands, we did not specify other parameters, instead used the defaults: (1) --p-left-trim-len with range(0, None) is used to trim sequence from the 5′ end. The default value of 0 will disable this trim. (2) --p-mean-error is used to specify the mean per nucleotide error for original sequence estimate. The default value is 0.005. (3) --p-indel-prob is used to specify the insertion/deletion (indel) probability (same for N indels). The default value is 0.01. (4) --p-indel-max is used to specify the maximum number of insertion/deletions. The default value is 3. (5) --p-min-reads is used to specify to retain only features appearing at least min-reads across all samples in the resulting feature table. The default value is 10. (6) --p-min-size is used to specify to discard all features with an abundance less than min-size in each sample. The default value is 2. (7) --p-jobs-to-start is used to specify the number of jobs to start (if to run in parallel). The default value is 1. And (8) --p-hashed-feature-ids / --p-no-hashed-feature-ids is used to specify whether or not hash the feature IDs. The default is true.

4.4.5 Summarize the Feature Table and Feature Data from Deblur

As in DADA2 approach, now we can summarize the feature table and feature data and review them via the QIIME2 viewer.

qiime feature-table summarize \

--i-table FeatureTableAtacama.qza \

--o-visualization FeatureTableAtacama.qzv \

--m-sample-metadata-file SampleMetadataAtacama.tsv

qiime feature-table tabulate-seqs \

--i-data RepSeqsAtacama.qza \

--o-visualization RepSeqsAtacama.qzv

qiime metadata tabulate \

--m-input-file DemuxJoinedFilterStatsAtacama.qza \

--o-visualization DemuxJoinedFilterStatsAtacama.qzv

4.4.6 Remarks on DADA2 and Deblur

DADA2 and Deblur share some common strengths over traditional denoising and QC filtering sequences methods, including:

Both DADA2 and Deblur bioinformatic sequence denoising approaches were proposed to correct sequencing errors to improve taxonomic resolution. Similar to traditional OTU method, both pipelines are self-contained, performing 16S rRNA gene sequencing data from raw sequences (i.e., FASTQ). However, both methods have advanced the sequencing from traditionally binning sequences into 97% OTUs to effectively using the sequences themselves as the unique identifier for a taxon (also referred to 100% OTU).
Both methods perform quality filtering, denoising, and chimera removal and require little data preparation, and don’t need to perform any quality screening prior to running them.
Both DADA2 and Deblur methods have their own advantages and disadvantages. They also share some common properties such as obtaining single-nucleotide resolution. However, we have no intention to compare DADA2 and Deblur methods.
Deblur runs its denoising process sample-by-sample, which helps lower Deblur’s computational requirements; however, it also reduces its ability to correct multi-run.
Another major difference between Deblur and DADA2 is that the built-in Deblur function in QIIME 2 uses a positive filter. That is, they use different strategies to handle errors: DADA2 corrects/changes errors or “erroneous” sequences to match the sequence from which they’re inferred to have arisen. In contrast, Deblur removes those error sequences. By using the default setting Deblur discards reads that do not reach the threshold to any sequence in the 88% representative sequences Greengenes database, which results in the remaining total numbers of sequences and features after denoising using Deblur are less than the total numbers of sequences and features after denoising using DADA2.
Deblur is able to handle processed data whereas DADA2 can only handle raw data. This difference highlights its importance of Deblur because many publicly available datasets are already processed.
Deblur was evaluated having poor performance than DADA2 and open-reference OTU clustering to detect low abundant taxa in the extreme dataset at 97% identity (Nearing et al. 2018).
However, our main goal here is to illustrate these two methods via QIIME 2 and introduce their theories accordingly as necessary to better understand their approaches.

4.5 Summary

In this chapter, we described two methods of building feature table and feature representative sequences data from raw reads, DADA2 and Deblur, illustrated their use in QIIME 2 plugins using two real 16S rRNA microbiome datasets. First, we introduced some general procedures to analyze demultiplexed paired-end FASTQ data including preparation of sample metadata and raw sequence data, importation of data files as Qiime Zipped Artifacts (.qza), as well as examination and visualization of the qualities of the sequence reads. Second, we introduced DADA2 and q2-dada2 plugin and illustrated how to use them to analyze demultiplexed paired-end FASTQ data and specifically to denoise sequences to construct feature table and feature data with q2-dada2 plugin as well as how to summarize the feature table and feature data from q2-dada2 plugin. Third, we illustrated how to analyze multiplexed paired-end FASTQ data using q2-dada2 plugin. Fourth, we introduced Deblur and q2-deblur plugin and illustrated how to use them to analyze demultiplexed paired-end FASTQ Data such as processing initial quality filtering, conducting preliminary works for denoising with Deblur, and particularly denoising sequences with Deblur to construct feature table and feature data and summarize them. Chapter 5 will introduce how to assign taxonomy and build phylogenetic tree. Chapter 6 will introduce the traditional OTUs clustering methods.

References

Amir, Amnon, Daniel McDonald, Jose A. Navas-Molina, Evguenia Kopylova, James T. Morton, Xu Zhenjiang Zech, Eric P. Kightley, Luke R. Thompson, Embriette R. Hyde, Antonio Gonzalez, and Rob Knight. 2017. Deblur rapidly resolves single-nucleotide community sequence patterns. mSystems 2 (2): e00191–e00116. https://doi.org/10.1128/mSystems.00191-16. https://www.ncbi.nlm.nih.gov/pubmed/28289731. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5340863/.CrossrefPubMedPubMedCentral
Bokulich, Nicholas A., Sathish Subramanian, Jeremiah J. Faith, Dirk Gevers, Jeffrey I. Gordon, Rob Knight, David A. Mills, and J. Gregory Caporaso. 2013. Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nature Methods 10 (1): 57–59. https://doi.org/10.1038/nmeth.2276.CrossrefPubMed
Callahan, B.J., P.J. McMurdie, M.J. Rosen, A.W. Han, A.J. Johnson, and S.P. Holmes. 2016a. DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods 13 (7): 581–583. https://doi.org/10.1038/nmeth.3869. https://www.ncbi.nlm.nih.gov/pubmed/27214047. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4927377/.CrossrefPubMedPubMedCentral
Callahan, Ben J., Kris Sankaran, Julia A. Fukuyama, Paul J. McMurdie, and Susan P. Holmes. 2016b. Bioconductor Workflow for Microbiome Data Analysis: From raw reads to community analyses. F1000Research 5: 1492–1492. https://doi.org/10.12688/f1000research.8986.2. https://www.ncbi.nlm.nih.gov/pubmed/27508062. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4955027/.CrossrefPubMedPubMedCentral
Callahan, Benjamin, Paul McMurdie, and Susan Holmes. 2017. Exact sequence variants should replace operational taxonomic units in marker-gene data analysis. The ISME Journal 11: 2639.CrossrefPubMedPubMedCentral
Caporaso, J.G., J. Kuczynski, J. Stombaugh, K. Bittinger, F.D. Bushman, E.K. Costello, N. Fierer, A.G. Peña, J.K. Goodrich, J.I. Gordon, G.A. Huttley, S.T. Kelley, D. Knights, J.E. Koenig, R.E. Ley, C.A. Lozupone, D. McDonald, B.D. Muegge, M. Pirrung, J. Reeder, J.R. Sevinsky, P.J. Turnbaugh, W.A. Walters, J. Widmann, T. Yatsunenko, J. Zaneveld, and R. Knight. 2010. QIIME allows analysis of high-throughput community sequencing data. Nature Methods 7 (5): 335–336. https://doi.org/10.1038/nmeth.f.303.CrossrefPubMedPubMedCentral
Edgar, Robert C., and Henrik Flyvbjerg. 2015. Error filtering, pair assembly and error correction for next-generation sequencing reads. Bioinformatics 31 (21): 3476–3482. https://doi.org/10.1093/bioinformatics/btv401.CrossrefPubMed
Fadrosh, Douglas W., Bing Ma, Pawel Gajer, Naomi Sengamalay, Sandra Ott, Rebecca M. Brotman, and Jacques Ravel. 2014. An improved dual-indexing approach for multiplexed 16S rRNA gene sequencing on the Illumina MiSeq platform. Microbiome 2 (1): 6–6. https://doi.org/10.1186/2049-2618-2-6. https://www.ncbi.nlm.nih.gov/pubmed/24558975. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3940169/.CrossrefPubMedPubMedCentral
Hildebrand, Falk, Raul Tadeo, Anita Yvonne Voigt, Peer Bork, and Jeroen Raes. 2014. LotuS: An efficient and user-friendly OTU processing pipeline. Microbiome 2 (1): 30. https://doi.org/10.1186/2049-2618-2-30.CrossrefPubMedPubMedCentral
Kumar, Surendra, Tor Carlsen, Bjørn-Helge Mevik, Pål Enger, Rakel Blaalid, Kamran Shalchian-Tabrizi, and Håvard Kauserud. 2011. CLOTU: An online pipeline for processing and clustering of 454 amplicon reads into OTUs followed by taxonomic annotation. BMC Bioinformatics 12 (1): 182. https://doi.org/10.1186/1471-2105-12-182.CrossrefPubMedPubMedCentral
Mysara, Mohamed, Mercy Njima, Natalie Leys, Jeroen Raes, and Pieter Monsieurs. 2017. From reads to operational taxonomic units: An ensemble processing pipeline for MiSeq amplicon sequencing data. GigaScience 6 (2). https://doi.org/10.1093/gigascience/giw017.
Nearing, Jacob T., Gavin M. Douglas, André M. Comeau, and Morgan G.I. Langille. 2018. Denoising the Denoisers: An independent evaluation of microbiome sequence error-correction approaches. PeerJ 6: e5364–e5364. https://doi.org/10.7717/peerj.5364. https://www.ncbi.nlm.nih.gov/pubmed/30123705. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6087418/.CrossrefPubMedPubMedCentral
Neilson, Julia W., Katy Califf, Cesar Cardona, Audrey Copeland, Will Van Treuren, Karen L. Josephson, Rob Knight, Jack A. Gilbert, Jay Quade, J. Gregory, and Caporaso. 2017. Significant impacts of increasing aridity on the arid soil microbiome. MSystems 2 (3): e00195–e00116.CrossrefPubMedPubMedCentral
Rideout, Jai Ram, John H. Chase, Evan Bolyen, Gail Ackermann, Antonio González, Rob Knight, and J. Gregory Caporaso. 2016. Keemei: Cloud-based validation of tabular bioinformatics file formats in Google Sheets. GigaScience 5 (1). https://doi.org/10.1186/s13742-016-0133-6.
Rosen, Michael J., Benjamin J. Callahan, Daniel S. Fisher, and Susan P. Holmes. 2012. Denoising PCR-amplified metagenome data. BMC Bioinformatics 13: 283–283. https://doi.org/10.1186/1471-2105-13-283. https://www.ncbi.nlm.nih.gov/pubmed/23113967. https://www.ncbi.nlm.nih.gov/pmc/PMC3563472/.CrossrefPubMedPubMedCentral
Schloss, Patrick D. 2020. Reintroducing mothur: 10 years later. Applied and Environmental Microbiology 86 (2): e02343–e02319.CrossrefPubMedPubMedCentral
Schloss, Patrick D., Alyxandria M. Schubert, Joseph P. Zackular, Kathryn D. Iverson, Vincent B. Young, and Joseph F. Petrosino. 2012. Stabilization of the murine gut microbiome following weaning. Gut Microbes 3 (4): 383–393. https://doi.org/10.4161/gmic.21008. https://www.ncbi.nlm.nih.gov/pubmed/22688727. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3463496/.CrossrefPubMedPubMedCentral
Xia, Yinglin, Jun Sun, and Ding-Geng Chen. 2018. Bioinformatic analysis of microbiome data. In Statistical analysis of microbiome data with R, 1–27. Singapore: Springer.Crossref